Speech Recognition Front End Without Information Loss
نویسندگان
چکیده
Phoneme classification is investigated for linear feature domains with the aim of improving robustness to additive noise. In linear feature domains noise adaptation is exact, potentially leading to more accurate classification than representations involving non-linear processing and dimensionality reduction. A generative framework is developed for isolated phoneme classification using linear features. Initial results are shown for representations consisting of concatenated frames from the centre of the phoneme, each containing f frames. As phonemes have variable duration, no single f is optimal for all phonemes, therefore an average is taken over models with a range of values of f . Results are further improved by including information from the entire phoneme and transitions. In the presence of additive noise, classification in this framework performs better than an analogous PLP classifier, adapted to noise using cepstral mean and variance normalisation, below 18dB SNR. Finally we propose classification using a combination of acoustic waveform and PLP log-likelihoods. The combined classifier performs uniformly better than either of the individual classifiers across all noise levels.
منابع مشابه
SNR-based mask compensation for computational auditory scene analysis applied to speech recognition in a car environment
In this paper, we propose a computational auditory scene analysis (CASA)–based front–end for two–microphone speech recognition in a car environment. One of the important issues associated with CASA is the accurate estimation of mask information for target speech separation within multiple microphone noisy speech. For such a task, the time–frequency mask information is compensated through the si...
متن کاملNoise robust hands-free speech recognition using microphone array and Kalman filter as front-end system of conversational TV
In this paper, we investigate hands-free speech recognition as front-end system of conversational TV. The conversational TV is one of machine conversation systems to retrieve the interesting information by inquiring it to the TV. To realize the natural machine conversation without consciousness of microphone, hands-free speech recognition is required. In the handsfree speech recognition system,...
متن کاملPerformance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments
In this paper, we propose a feature enhancement algorithm for wireless speech recognition in adverse acoustic environments. A speech recognition system is realized at the network side of a wireless communications system and feature parameters are extracted directly from the bitstream of the speech coder employed in the system, where the feature parameters are composed of spectral envelope infor...
متن کاملA bitstream-based front-end for wireless speech recognition on IS-136 communications system
In this paper, we propose a feature extraction method for a speech recognizer that operates in digital communication networks. The feature parameters are basically extracted by converting the quantized spectral information of a speech coder into a cepstrum. We also include the voiced/unvoiced information obtained from the bitstream of the speech coder in the recognition feature set. We performe...
متن کاملRecognizing voice over IP: a robust front-end for speech recognition on the world wide web
The Internet Protocol (IP) environment poses two relevant sources of distortion to the speech recognition problem: lossy speech coding and packet loss. In this paper, we propose a new front-end for speech recognition over IP networks. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bit stream) instead of decoding it and subsequentl...
متن کاملSpectral Features for Automatic Text-Independent Speaker Recognition
Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013